The aim of this project will be to perform clustering of houses located in California, based on the following attributes:
longitude: A measure of how far west a house is;
latitude: A measure of how far north a house is;
housingMedianAge: Median age of a house within a block; a lower number is a newer building
totalRooms: Total number of rooms within a block
totalBedrooms: Total number of bedrooms within a block
population: Total number of people residing within a block
households: Total number of households in each block
medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
medianHouseValue: Median house value for households within a block (measured in US Dollars)
oceanProximity: Location of the house w.r.t ocean/sea
In the first section, we will implement Principal Component Analysis to reduce the dimensionality of the data, by identifying the main underlying components which explain the majority of the variability.
Next, we will use the retained components to perform both hierarchical and non hierarchical clustering. This will allow to group houses according to the most important uncorrelated criteria.
We can begin by visualizing the data.
To get an idea of the distribution of the variables we plot the histograms.
Next, since the dataset includes spatial information regarding longitude and latitude, we also plot the observations on a map, coloring them by our categorical variable “ocean proximity”.
For the purposes of PCA we exclude the categorical variable “ocean proximity” from our dataset. We will include it again when performing hierarchical clustering.
In order to proceed with PCA, we need to perform variables’ scaling, otherwise the process would be biased by the different unit of measures. It is possible to check that the scaling was succesful by verifying that for each variable the mean is equal to 0, and standard deviation is equal to 1.
## means sds
## longitude 0 1
## latitude 0 1
## housing_median_age 0 1
## total_rooms 0 1
## total_bedrooms 0 1
## population 0 1
## households 0 1
## median_income 0 1
## median_house_value 0 1
## Comp1 Comp2 Comp3 Comp4
## longitude 0.1510000 0.9180000 0.3230000 0.0380000
## latitude -0.1500000 -0.9570000 -0.1660000 0.0820000
## housing_median_age -0.4280000 -0.0010000 -0.0650000 -0.8890000
## total_rooms 0.9580000 -0.0840000 -0.1110000 -0.0280000
## total_bedrooms 0.9680000 -0.1000000 0.0540000 -0.1180000
## population 0.9300000 -0.0640000 0.1020000 -0.1150000
## households 0.9710000 -0.1000000 0.0340000 -0.1380000
## median_income 0.1100000 0.2470000 -0.8740000 0.2000000
## median_house_value 0.0890000 0.2670000 -0.8770000 -0.1580000
## % of VAR explained 0.4347663 0.2135917 0.1885702 0.1011284
The sum of the squares of the values of each row of the component matrix is the respective communality.
## Comp1 Comp2 Comp3 Comp4 communality
## longitude 0.151 0.918 0.323 0.038 0.971298
## latitude -0.150 -0.957 -0.166 0.082 0.972629
## housing_median_age -0.428 -0.001 -0.065 -0.889 0.977731
## total_rooms 0.958 -0.084 -0.111 -0.028 0.937925
## total_bedrooms 0.968 -0.100 0.054 -0.118 0.963864
## population 0.930 -0.064 0.102 -0.115 0.892625
## households 0.971 -0.100 0.034 -0.138 0.973041
## median_income 0.110 0.247 -0.874 0.200 0.876985
## median_house_value 0.089 0.267 -0.877 -0.158 0.873303
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.9781042 1.3864795 1.3027401 0.9540206 0.5413259
## Proportion of Variance 0.4347663 0.2135917 0.1885702 0.1011284 0.0325593
## Cumulative Proportion 0.4347663 0.6483580 0.8369282 0.9380566 0.9706159
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.37788200 0.249770513 0.210915373 0.121621662
## Proportion of Variance 0.01586609 0.006931701 0.004942811 0.001643537
## Cumulative Proportion 0.98648195 0.993413653 0.998356463 1.000000000
This plot represents how each observation and component vectors take place in the 4D component space: we tried to reach an optimal representation by plotting the first three components in a 3 dimensional space, then the fourth is depicted as the intensity of the two extreme colors that we show in the legend.
This graph can be interpret as follows:
From this plot we can notice the presence of three possible clusters: two are placed at the extremes of the cloud of points and the third is smaller and placed between the two others.
We will now proceed with non-hierarchical clustering, in particular we will perform the k-means method.
## [1] 0.8369282
The first three components retain 83.69% of the total variance, so use the three components for Kmeans Clustering Analysis.
When the number of clusters is 4, Calinski-Harabasz(CH) Index reaches a peak, so we choose it as the number of clusters for Kmeans.
##
## 1 2 3 4
## 7067 2970 8865 1531
We can visualize the boxplots for the distribution of the 4 components retained in PCA across the 4 identified clusters.
Difference:
Combined with the interpretation of PCA:
Cluster1: Component2 correlates the most with longitude and latitude, the higher the component 2 is, the higher the longitude and the lower the latitude, the samples in cluster1 shows higher negative values, meaning that they have lower longitude and higher latitude, and as shown the map of California, they are all the houses in the northwest.
Cluster2: This cluster shows negative values in component3. Component3 correlates the most with median income and median house value, the higher the component 3 is, the lower the median income and the median house value are. So houses in cluster 2 have higher median income and median house values. And in the map, we can find that cluster2 appear along the coastline, meaning this kind of houses have higher values.
Cluster3: Opposed to Cluster2, cluster3 has positive values in component3, thus houses in cluster 3 have lower median income and median house values. And since most of them have positive values in component2, they appear mostly in southeast of the map.
Cluster4: This cluster shows higher positive values in component1. Component1 positively correlates the most with total_rooms, total_bedrooms, population and households. So samples in cluster 4 are houses with high block populousness.
In summary:
We are now going to implement hierarchical clustering, exploiting Gower’s Distance, which allows to compute the similarity between observation based on both numerical and categorical variables. In fact, in addition to the three components found with PCA (popolousness, position, wealth), the following analysis also considers the variable ocean proximity, which conceptualizes the distance from the ocean through five levels; in particular it takes into account San Francisco’s Bay (NEAR BAY).
We compute Gower’s index to find the distances among the observations. As a technical sidenote, we also use the parallel package to speed up the computation by performing it on 3 cores.
The output is a distance matrix, which of course has a number of rows and columns equal to the number of observations. For simplicity, we only display the first 5 rows and columns.
## sequential:
## - args: function (..., envir = parent.frame())
## - tweaked: FALSE
## - call: NULL
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.0000000 0.2833059 0.2546131 0.2291734 0.5183503
## [2,] 0.2833059 0.0000000 0.5250509 0.2922206 0.5510373
## [3,] 0.2546131 0.5250509 0.0000000 0.4421927 0.4845992
## [4,] 0.2291734 0.2922206 0.4421927 0.0000000 0.5150133
## [5,] 0.5183503 0.5510373 0.4845992 0.5150133 0.0000000
Among all the linkage methods we selected Ward’s, as it provides the most balanced clusters.
We decided to cut the dendrogram at a height of 10, thus retaining 3 clusters. It could also be possible to split the data in 5 groups, but the clusters were not so well defined.
##
## 1 2 3
## 1108 1736 2156
We can visualize the distribution of the clusters along the two main components, which are the two dimensions explaining most of the variablity in the data.
The clusters are indeed quite balanced and well separated.
There are some houses displaying a particularly high value for component 1, which could be considered outliers, but for the scope of this project we decide to retain them, by assigning them to the closest cluster.
In order to see how the components behave in the three clusters, we display the density plots of their distribution.
From these plots we can observe the following:
To visualize the relation between the groups and the qualitative variable “ocean_proximity” we can display a barchart with values expressed in percentage. In this way we can more effectively compare the distribution of the categories, given the different numerosity of the clusters.
Thanks to the geographical coordinates we can plot our data on California’s map, and confirm our previous findings.
In order to identify the peculiarities of each group, we can compute the means of the three components and compare them according to the related concepts.
## Group.1 component1 component2 component3
## 1 1 1.6319536 0.4899765 -1.41564165
## 2 2 -0.6951695 -1.5132952 0.02713028
## 3 3 -0.2476038 0.9933227 0.70674544
Group 1:
Group 2:
Group 3: